library(vegan)
library(qiime2R)
library(dendextend)
library(cluster)
library(dplyr)
library(corrplot)
library(ggpubr)
library(Hmisc)
library(phyloseq)
library(ggplot2)
library(plotly)
library(DT)
phyloseq Demo DataFor demonstration purposes, we will be using a dataset available in
the phyloseq package. This table consists of 19,216
features (OTUs) in rows and 26 samples in columns. Note that this table
is not yet rarefied, so we will demonstrate below how to rarefy a
feature table in R.
data("GlobalPatterns")
feature_table <- as.data.frame(otu_table(GlobalPatterns))
str(feature_table)
## 'data.frame': 19216 obs. of 26 variables:
## $ CL3 : num 0 0 0 0 0 0 7 0 153 3 ...
## $ CC1 : num 0 0 0 0 0 0 1 0 194 5 ...
## $ SV1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ M31Fcsw : num 0 0 0 0 0 0 0 0 0 0 ...
## $ M11Fcsw : num 0 0 0 0 0 0 0 0 0 0 ...
## $ M31Plmr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ M11Plmr : num 0 0 1 0 0 0 3 0 0 0 ...
## $ F21Plmr : num 0 0 0 0 0 0 0 0 0 0 ...
## $ M31Tong : num 0 0 0 0 0 0 0 0 0 0 ...
## $ M11Tong : num 0 0 0 0 0 0 0 0 0 0 ...
## $ LMEpi24M: num 0 0 0 0 0 0 0 0 0 0 ...
## $ SLEpi20M: num 1 0 0 0 0 0 0 0 0 0 ...
## $ AQC1cm : num 27 0 0 0 0 0 0 0 1 0 ...
## $ AQC4cm : num 100 2 0 22 2 1 0 1 0 0 ...
## $ AQC7cm : num 130 6 0 29 1 3 0 0 1 0 ...
## $ NP2 : num 1 0 0 0 0 0 0 0 0 0 ...
## $ NP3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NP5 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TRRsed1 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TRRsed2 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TRRsed3 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TS28 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ TS29 : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Even1 : num 0 0 0 0 0 0 0 0 2 0 ...
## $ Even2 : num 0 0 0 0 0 0 0 0 1 0 ...
## $ Even3 : num 0 0 0 0 0 0 0 0 0 0 ...
Additionally, we will also make use of the metadata file associated with the samples.
metadata <- as(sample_data(GlobalPatterns), 'data.frame')
names(metadata)[1] <- 'Samples' # Convert column name from X.SampleID to Samples
str(metadata)
## 'data.frame': 26 obs. of 7 variables:
## $ Samples : Factor w/ 26 levels "AQC1cm","AQC4cm",..: 5 4 21 14 11 15 12 9 16 13 ...
## $ Primer : Factor w/ 26 levels "ILBC_01","ILBC_02",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Final_Barcode : Factor w/ 26 levels "AACGCA","AACTCG",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Barcode_truncated_plus_T: Factor w/ 26 levels "AACTGT","ACAGGT",..: 24 13 3 20 9 5 17 7 19 11 ...
## $ Barcode_full_length : Factor w/ 26 levels "AGAGAGACAGG",..: 10 6 18 21 8 9 16 19 26 20 ...
## $ SampleType : Factor w/ 9 levels "Feces","Freshwater",..: 8 8 8 1 1 7 7 7 9 9 ...
## $ Description : Factor w/ 25 levels "Allequash Creek, 0-1cm depth",..: 4 5 20 14 11 15 12 9 16 13 ...
If you want to use your own data, some ways to load your feature tables as dataframe in R are outlined below.
setwd() function.
You can directly load your table from a .qza file (type
FeatureData[Frequency]) to an R dataframe using the package
qiime2R.
feature_table <- read_qza('feature-table.qza')
Alternatively, if you have a tab-separated file, you can use the
read.table function. We assume here that your
feature-table.tsv file has (1) sample names in row 1, and
(2) has feature IDs in column 1. If not, you can format your table in
Excel or alter the code below.
feature_table <- read.table('feature-table.tsv', sep='\t', header=T, row.names=1)
Rarefaction is simply a process of randomly subsampling the data. In this context, we randomly select N reads from each sample so that each sample would end up having equal sampling depth. The reason for this is that different samples often have varying sequencing depths, which can introduce bias when comparing diversity metrics across samples. By rarefying to a uniform depth, we mitigate the impact of sequencing depth variability, allowing for a more accurate comparison of diversity metrics between samples.
Diversity Analysis
in QIIME2.ipynb.
datatable(
t(rare_ft <- as.data.frame(rrarefy(t(feature_table), sample=min(colSums(feature_table))))),
options=list(scrollX=T)
)
Many functions related to ecological statistics are available in the
package vegan. In the sub-sections below, we will
demonstrate how to calculate various diversity indices.
Some alpha diversity indices can be calculated using the
diversity() function. By default, diversity()
uses the shannon index, but you could change this by adding
the parameter index. Meanwhile, diversity indices that
attempts to extrapolate species richness (e.g. chao1,
ACE) can be obtained using the estimateR()
function.
Shannon Index
(shannon <- diversity(rare_ft))
## CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
## 6.532907 6.727936 6.453076 3.820125 3.270314 4.279470 4.822593 4.855435
## M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2
## 2.649999 3.892083 3.089139 3.633438 3.497326 3.345948 3.973473 4.208867
## NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
## 4.476561 4.540608 6.157462 4.834081 5.426870 4.115519 3.438758 4.064345
## Even2 Even3
## 3.946545 3.990425
Simpson Index
(simpson <- diversity(rare_ft, index='simpson'))
## CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
## 0.9946957 0.9952135 0.9962438 0.9276781 0.9089478 0.9389035 0.9518425 0.9779817
## M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2
## 0.8607580 0.9355877 0.8041046 0.9077420 0.7622558 0.7406628 0.8156339 0.9530782
## NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
## 0.9719659 0.9742975 0.9924388 0.9640802 0.9814767 0.9650703 0.9181868 0.9683092
## Even2 Even3
## 0.9637993 0.9671628
Chao1
(chao1 <- estimateR(rare_ft)['S.chao1', ])
## CL3 CC1 SV1 M31Fcsw M11Fcsw M31Plmr M11Plmr F21Plmr
## 4756.1326 4988.9222 3912.2528 1216.0851 948.3152 2001.4928 3252.4121 2718.5778
## M31Tong M11Tong LMEpi24M SLEpi20M AQC1cm AQC4cm AQC7cm NP2
## 989.4783 2947.2521 1420.3462 1913.6321 3590.0556 3088.9110 3453.7548 1835.5610
## NP3 NP5 TRRsed1 TRRsed2 TRRsed3 TS28 TS29 Even1
## 1935.4903 1518.8942 4510.2748 3468.7245 3935.5298 1503.5283 1120.1031 2007.5769
## Even2 Even3
## 1749.8000 1500.8000
Below, we visually compare the shannon diversity index
of samples across the metadata variable SampleType.
First, we merge the per-sample alpha diversity index to the metadata file.
datatable(
alpha_div_df <- as.data.frame(shannon) %>% tibble::rownames_to_column('Samples'), options=list()
)
sample_group_div <- merge(metadata, alpha_div_df, by='Samples')
str(sample_group_div)
## 'data.frame': 26 obs. of 8 variables:
## $ Samples : Factor w/ 26 levels "AQC1cm","AQC4cm",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Primer : Factor w/ 26 levels "ILBC_01","ILBC_02",..: 13 14 15 2 1 24 25 26 8 11 ...
## $ Final_Barcode : Factor w/ 26 levels "AACGCA","AACTCG",..: 13 14 15 2 1 24 25 26 8 11 ...
## $ Barcode_truncated_plus_T: Factor w/ 26 levels "AACTGT","ACAGGT",..: 25 6 8 13 24 23 12 2 7 10 ...
## $ Barcode_full_length : Factor w/ 26 levels "AGAGAGACAGG",..: 13 4 3 6 10 24 23 1 19 7 ...
## $ SampleType : Factor w/ 9 levels "Feces","Freshwater",..: 3 3 3 8 8 4 4 4 7 2 ...
## $ Description : Factor w/ 25 levels "Allequash Creek, 0-1cm depth",..: 1 2 3 5 4 6 7 8 9 10 ...
## $ shannon : num 3.5 3.35 3.97 6.73 6.53 ...
Now, we create boxplots to visualize the distribution of
shannon diversity across the sample groups.
alpha_div_boxplot <- ggplot(sample_group_div, aes(x=shannon, y=SampleType, fill=SampleType)) +
geom_boxplot() +
geom_jitter(height=0.2, size=0.5) +
theme_bw() +
xlab('Shannon Diversity') +
ylab('Sample Type')
alpha_div_boxplot
In QIIME2, Kruskal-Wallis test is used to test if sample groups have different alpha diversity. This is essentially the non-parametric counterpart of one-way ANOVA. Instead of comparing the means, Kruskal-Wallis test compares the rank sums.
We can implement it in R as shown below. The formula
shannon ~ SampleType tells the test to compare the alpha
diversity metric of samples grouped according to
SampleType. In the results below, the p-value is below the
significance level (\(\alpha=0.05\)),
hence, we reject the null hypothesis.
(kw_test <- kruskal.test(shannon ~ SampleType, data=sample_group_div))
##
## Kruskal-Wallis rank sum test
##
## data: shannon by SampleType
## Kruskal-Wallis chi-squared = 22.125, df = 8, p-value = 0.004689
Beta diversity indices can be calculated using the
vegdist() function. This generates a dissimilarity index,
and by default uses the Bray-Curtis index (bray).
Bray-Curtis
bray <- vegdist(rare_ft)
Euclidean
eucl <- vegdist(rare_ft, method='euclidean')
Jaccard
binary=T.
jacc <- vegdist(rare_ft, method='jaccard', binary=T)
Principal coordinate analysis (PCoA) is a multivariate analysis technique that begins with a distance matrix, typically derived from a beta diversity index. It then projects samples into a lower-dimensional space (commonly 2 or 3 dimensions) to facilitate visualization of potential clusters within the dataset.
Below, we use the Bray-Curtis dissimilarity matrix for demonstration.
pcoa <- cmdscale(bray, eig=T)
A simple visualization can be created using the
ordiplot() function.
ordiplot(pcoa, display='sites', type='text')
We can make this prettier by using ggplot(). First, we
collect the coordinates of the points.
datatable(pcoa_points <- as.data.frame(pcoa$points) %>% tibble::rownames_to_column('Samples'))
Then, we merge this with the metadata table.
datatable(
pcoa_w_metadata <- merge(metadata, pcoa_points, by='Samples'),
options=list(scrollX=T)
)
We also calculate the variance explained by the first two principal components (PC).
var_PCs <- pcoa$eig / sum(pcoa$eig)
(var_PC1 <- var_PCs[1])
## [1] 0.1466622
(var_PC2 <- var_PCs[2])
## [1] 0.1215819
Now, we recreate the plot above, but instead of sample names as
points, we color the points according to SampleType.
pcoa_plot <- ggplot(pcoa_w_metadata, aes(x=V1, y=V2, color=SampleType)) +
geom_point() +
theme_bw() +
xlab(paste0('PC1 (', round(var_PC1*100, 2), '%)')) +
ylab(paste0('PC2 (', round(var_PC2*100, 2), '%)'))
ggplotly(pcoa_plot)
Hierarchical clustering is a method that builds a tree-like structure (dendrogram) based on pairwise distances between samples. It starts by treating each sample as its own cluster, then iteratively merges the closest clusters until all samples are joined into one. This approach can reveal nested groupings within the data and highlight relationships across multiple levels of similarity.
There are different agglomeration methods in hierarchical clustering.
Hence, we first try different methods and find what approach represents
will the originally calculated distances (we use bray
again).
Below, we will compare the following methods: single,
complete, average, ward.
par(mfrow = c(2,2))
# Single linkage
otu.bc.sing <- hclust(bray, method = "single")
dend <- as.dendrogram(otu.bc.sing)
dend %>% set("labels_cex", 0.8) %>%
plot(cex = 0.8,
cex.axis = 0.8,
cex.lab = 0.9,
horiz = FALSE)
title(main = "Single linkage", line = 0.5)
# Complete linkage
otu.bc.comp <- hclust(bray, method = "complete")
dend <- as.dendrogram(otu.bc.comp)
dend %>% set("labels_cex", 0.8) %>%
plot(cex = 0.8,
cex.axis = 0.8,
cex.lab = 0.9,
horiz = FALSE)
title(main = "Complete linkage", line = 0.5)
# UPGMA
otu.bc.upgma <- hclust(bray, method = "average")
dend <- as.dendrogram(otu.bc.upgma)
dend %>% set("labels_cex", 0.8) %>%
plot(cex = 0.8,
cex.axis = 0.8,
cex.lab = 0.9,
horiz = FALSE)
title(main = "UPGMA", line = 0.5)
# Ward
otu.bc.ward <- hclust(bray, method = "ward.D2")
dend <- as.dendrogram(otu.bc.ward)
dend %>% set("labels_cex", 0.8) %>%
plot(cex = 0.8,
cex.axis = 0.8,
cex.lab = 0.9,
horiz = FALSE)
title(main = "Ward", line = 0.5)
We look at the cophenetic correlation to find which approach retains
most of the information contained in the dissimilarity matrix. Based on
the metrics below, we see that average linkage (UPGMA)
yields the highest correlation coefficient.
Single Linkage
otu.bc.sing.coph <- cophenetic(otu.bc.sing)
(cor(bray, otu.bc.sing.coph))
## [1] 0.9620638
Complete Linkage
otu.bc.comp.coph <- cophenetic(otu.bc.comp)
(cor(bray, otu.bc.comp.coph))
## [1] 0.9781153
UPGMA
otu.bc.upgma.coph <- cophenetic(otu.bc.upgma)
(cor(bray, otu.bc.upgma.coph))
## [1] 0.9858692
Ward
otu.bc.ward.coph <- cophenetic(otu.bc.ward)
(cor(bray, otu.bc.ward.coph))
## [1] 0.8696503
We can also visualize these into Shepard-like diagrams. The plots below essentially give the same results as the metrics calculated above.
par(mfrow = c(2,2))
plot(
bray,
otu.bc.sing.coph,
xlab = "Bray-Curtis dissimilarity",
ylab = "Cophenetic distance",
asp = 1,
xlim = c(0, 1),
ylim = c(0, 1),
main = c("Single linkage",
paste("Cophenetic correlation =", round(cor(bray,
otu.bc.sing.coph), 3)))
)
abline(0, 1)
lines(lowess(bray, otu.bc.sing.coph),
col = "red")
plot(
bray,
otu.bc.comp.coph,
xlab = "Bray-Curtis dissimilarity",
ylab = "Cophenetic distance",
asp = 1,
xlim = c(0, 1),
ylim = c(0, 1),
main = c("Complete linkage",
paste("Cophenetic correlation =",
round(cor(bray, otu.bc.comp.coph), 3)))
)
abline(0, 1)
lines(lowess(bray, otu.bc.comp.coph),
col = "red")
plot(
bray,
otu.bc.upgma.coph,
xlab = "Bray-Curtis dissimilarity",
ylab = "Cophenetic distance",
asp = 1,
xlim = c(0, 1),
ylim = c(0, 1),
main = c("UPGMA",
paste("Cophenetic correlation =",
round(cor(bray, otu.bc.upgma.coph), 3)))
)
abline(0, 1)
lines(lowess(bray, otu.bc.upgma.coph),
col = "red")
plot(
bray,
otu.bc.ward.coph,
xlab = "Bray-Curtis dissimilarity",
ylab = "Cophenetic distance",
asp = 1,
xlim = c(0, 1),
ylim = c(0, 1),
main = c("Ward",
paste("Cophenetic correlation =",
round(cor(bray, otu.bc.ward.coph), 3)))
)
abline(0, 1)
lines(lowess(bray, otu.bc.ward.coph),
col = "red")
Finally, Gower distances also provide another measurable quantity that describes how well cophenetic distances match with the original distance matrix.
Single Linkage
(gow.dist.single <- sum((bray - otu.bc.sing.coph) ^ 2))
## [1] 1.260377
Complete Linkage
(gow.dist.comp <- sum((bray - otu.bc.comp.coph) ^ 2))
## [1] 0.3708774
UPGMA
(gow.dist.UPGMA <- sum((bray - otu.bc.upgma.coph) ^ 2))
## [1] 0.1835741
Ward
(gow.dist.ward <- sum((bray - otu.bc.ward.coph) ^ 2))
## [1] 128.238
Based on the results above, it seems that UPGMA (or
average linkage clustering) best represents the distance
matrix used. Let’s plot it again for reference.
dend %>% set("labels_cex", 0.8) %>%
plot(cex = 0.8,
cex.axis = 0.8,
cex.lab = 0.9,
horiz = FALSE)
title(main = "UPGMA", line = 0.5)
A dendrogram’s interpretation largely depends on the number of clusters you define. As such, it is crucial that the number of clusters you select results in samples that are cohesive with samples of its own cluster, and at the same time well-separated with samples from other clusters. This is what silhouette widths attempt to measure.
First, we select the “best” dendrogram generated above (UPGMA).
hc <- otu.bc.upgma
For each possible number of clusters, we calculate the silhoutte width value.
Si <- numeric(nrow(rare_ft))
for (k in 2:(nrow(rare_ft) - 1)) {
sil <- silhouette(cutree(hc, k = k), bray)
Si[k] <- summary(sil)$avg.width
}
(k.best <- which.max(Si))
## [1] 8
Finally, we plot silhouette widths versus cluster number.
plot(
1:nrow(rare_ft),
Si,
type = "h",
main = "Silhouette-optimal number of clusters",
xlab = "k (number of clusters)",
ylab = "Average silhouette width"
)
axis(
1,
k.best,
paste("optimum", k.best, sep = "\n"),
col = "red",
font = 2,
col.axis = "red"
)
points(k.best,
max(Si),
pch = 16,
col = "red",
cex = 1.5)
We see that 8 is the optimal value, thus, we “cut” the tree at a
height such that we get 8 distinct clusters. In this demo data, samples
primarily clustered according to SampleType except for
samples coming from skin/palm and tongue which clustered together.
dend %>% set("labels_cex", 0.8) %>%
plot(cex = 0.8,
cex.axis = 0.8,
cex.lab = 0.9,
horiz = FALSE)
title(main = "UPGMA", line = 0.5)
abline(a=1.1, b=0, col='red')
Lastly, we could also test if distances of samples within and between
sample groups are significantly different. Below, we use the R function
adonis(). Since we are only using a single regressor, the
results should be the same as a one-way PERMANOVA. Results show a
significant p-value (at \(\alpha=0.05\)).
(adonis_test <- adonis2(bray ~ SampleType, data=metadata))
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 999
##
## adonis2(formula = bray ~ SampleType, data = metadata)
## Df SumOfSqs R2 F Pr(>F)
## SampleType 8 8.2345 0.72787 5.6839 0.001 ***
## Residual 17 3.0786 0.27213
## Total 25 11.3130 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.33 plotly_4.10.4 phyloseq_1.42.0 Hmisc_5.1-3
## [5] ggpubr_0.6.0 ggplot2_3.5.1 corrplot_0.92 dplyr_1.1.4
## [9] cluster_2.1.6 dendextend_1.17.1 qiime2R_0.99.6 vegan_2.6-6.1
## [13] lattice_0.22-6 permute_0.9-7
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-165 bitops_1.0-8 httr_1.4.7
## [4] GenomeInfoDb_1.34.9 tools_4.2.2 backports_1.5.0
## [7] bslib_0.8.0 utf8_1.2.4 R6_2.5.1
## [10] rpart_4.1.23 lazyeval_0.2.2 DBI_1.2.3
## [13] BiocGenerics_0.44.0 mgcv_1.9-1 colorspace_2.1-1
## [16] rhdf5filters_1.10.1 ade4_1.7-22 nnet_7.3-19
## [19] withr_3.0.1 NADA_1.6-1.1 tidyselect_1.2.1
## [22] gridExtra_2.3 compiler_4.2.2 cli_3.6.3
## [25] Biobase_2.58.0 htmlTable_2.4.3 labeling_0.4.3
## [28] sass_0.4.9 scales_1.3.0 checkmate_2.3.2
## [31] stringr_1.5.1 digest_0.6.36 foreign_0.8-87
## [34] rmarkdown_2.28 XVector_0.38.0 base64enc_0.1-3
## [37] pkgconfig_2.0.3 htmltools_0.5.8.1 highr_0.11
## [40] fastmap_1.2.0 htmlwidgets_1.6.4 rlang_1.1.4
## [43] rstudioapi_0.16.0 zCompositions_1.5.0-4 farver_2.1.2
## [46] jquerylib_0.1.4 generics_0.1.3 jsonlite_1.8.8
## [49] crosstalk_1.2.1 car_3.1-2 RCurl_1.98-1.14
## [52] magrittr_2.0.3 GenomeInfoDbData_1.2.9 Formula_1.2-5
## [55] biomformat_1.26.0 Matrix_1.5-1 Rcpp_1.0.13
## [58] munsell_0.5.1 S4Vectors_0.36.2 Rhdf5lib_1.20.0
## [61] fansi_1.0.6 abind_1.4-5 ape_5.8
## [64] viridis_0.6.5 lifecycle_1.0.4 stringi_1.8.4
## [67] yaml_2.3.10 carData_3.0-5 MASS_7.3-60
## [70] zlibbioc_1.44.0 rhdf5_2.42.1 plyr_1.8.9
## [73] grid_4.2.2 parallel_4.2.2 crayon_1.5.3
## [76] Biostrings_2.66.0 splines_4.2.2 multtest_2.54.0
## [79] knitr_1.48 pillar_1.9.0 igraph_2.0.3
## [82] ggsignif_0.6.4 reshape2_1.4.4 codetools_0.2-20
## [85] stats4_4.2.2 glue_1.7.0 evaluate_0.24.0
## [88] data.table_1.15.4 vctrs_0.6.5 foreach_1.5.2
## [91] gtable_0.3.5 purrr_1.0.2 tidyr_1.3.1
## [94] cachem_1.1.0 xfun_0.46 broom_1.0.6
## [97] rstatix_0.7.2 survival_3.7-0 viridisLite_0.4.2
## [100] truncnorm_1.0-9 tibble_3.2.1 iterators_1.0.14
## [103] IRanges_2.32.0